Skip to main content

D3Events Data Sharing Considerations

Overview

The purpose of this draft Charter is to identify the steps that will be required to develop the D3Events (network monitoring) capability. This draft outline is internal to the MS SIG for comment and further development.

Glossary

TBC:

TermInterpretation
ContributorA person who permits the collection of data from IoT devices under their control.
ResearcherA person (or team) who is authorised access to the contributed D3Events data in order to analyse the traffic flows and to develop models of 'normal' traffic flows on an IoT network.
Service ProviderAn organisation that receives contributed D3Events data in order to provide a ManySecured service to a customer.

A critical precursor is to determine what information will be gathered, stored and shared more widely. This can be divided between:

  • Personal details of Contributors/users (name, contact details, address, etc.) - used for account management and queries, but not intended for public sharing. Will data hosting and processing be subject to EU GDPR constraints, UK 2018 Data Protection Act (under review) or other legislation? International transfers may need to meet adequacy agreements.
  • Data harvesting (network traffic, device types, certificates, etc.) - used for analysis and training machine learning models. ManySecured cannot guarantee that a user's identity, (general) location and other personal traffic (e.g.: www traffic, such as banking and web-sites visited) may become identifiable, especially when combined with other sources of data/traffic. Any terms and conditions will need to assume that the data may become attributable to a contributor, maybe not by name but by network and general location.
  • Data sharing - does the collected data need to be shared in order to avoid anti-competition law (probably not - TBC)? If not, what licence will be used for accessing the data lake and what limitations will be put in place (e.g.: SIG members only, any exclusions, such as country)?

Terms and Conditions will need to be written for Contributors, Researchers and Service Providers. Due to the liabilities for breaches of personal data (4% of turnover under GDPR), an assumption must be that Contributors accept that all data on their IoT networks may become public and that they do not hold ManySecured liable for any breach.

The RIPE Atlas project, established in 2010, has already addressed some of these issues. Personal data held in the RIPE Database is available to the public, although the RIPE database is subject to terms and conditions to avoid mass mining of data.

ManySecured may need to provide advisory guidelines to contributors as to how they could segregate the data that they are willing to share (from IoT devices) from data that they wish to remain (more) private (e.g.: IT/laptop/tablet browsing, banking, web-cams, 3rd party data - such as CCTV, etc.).

Data Gathering Protocols

Activities may include:

  • Defining the data types to be collected and the reporting schema. This is expected to evolve over the development of the ManySecured specifications. An initial certificate harvesting specification is available at https://github.com/TechWorksHub/ManySecured-SUIB/blob/main/Certificate%20harvesting%20spec.md, but its purpose is limited to understanding how certificates are currently being used on local networks, as this is required for SUIB.
  • Anonymisation of Contributor data. This is notoriously difficult to achieve more than superficially especially if device identifiers appear on more than one network (eg: devices are taken from home to work environments) or public IP addresses are captured. In an industrial setting, trade secrets (pressures, temperatures, flow rates) would need to be sanitised before uploading.
  • Identifying existing tools, or developing new tools, for automating the collection (see the RIPE Atlas probe and anchor as examples of existing probes). This may need to include collection of non-IP traffic (e.g.: Zigbee, USB) in order to identify anomalous behaviour (why is a smart toaster communicating with the door locks?).
  • Assess the volumes and variety of data sources needed to perform analysis and to reliably train ML (machine-learning) models.
  • Identifying contributors willing to share their data.

Data Storage and Security

ML algorithms benefit from large volumes of data during the training phase, and potentially the MS data lake could become very large during the development phase. Hosting of this information will need to take account of:

  • Jurisdiction and data protection laws, and how this may encourage/discourage potential contributors (for example, EU GDPR has much greater constraints on protecting user data than many other jurisdictions).
  • Reliability - what level of service is required (likely to be a trade-off against price) and what is the impact of the data store being unavailable for shrot or extended periods of time.
  • Access controls - for both contributors to see/delete their own data and for analysis by authorised stakeholders.

Recruitment of Contributors

This activity will need to start in parallel with all the other activities, but it is likely to be dependent upon many of the other issues being sufficiently mature (e.g. T&Cs, data gathering tools and protocols)

Data Sharing

Who will be authorised to access the data lake, during the development phase and then during the in-service phase?

How long will data be retained and 'live' on the system? For example, data older than 6 months may be deleted from the service provider data lake, but archived for several years in case there is any need for forensics or legal challenge.

How is semi-anonymised data tagged to avoid bulk data identification of individuals and their pattern of life?

Outstanding work (To-Do List)

  1. Peer review by MS SIG and potential Contributors.
1
2
3